POLI 572B

Michael Weaver

February 24, 2020

Multivariate Least Squares

Objectives

Recap

  • Multivariate Least Squares mathematical properties

Interpretation

  • Best Linear Approximation of CEF
  • Dummy Variables
  • Slopes and Intercepts
  • Multicollinearity

Today

  • algorithm with mathematical properties
    • what is the algorithm doing?
    • what is mathematical truth of the algorithm?

Today

  • How does the algorithm relate back to the conditional expectation function?

Today

  • Interpreting simple scenarios:

    • dummy variables
    • slopes and intercepts

Recap

Least Squares

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression.

If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)

If \(X\) is a column of \(1\)s and a column of \(x\)s, it is bivariate regression.

We can now add \(p > 2\): more variables

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

Multivariate Least Squares:

Matrix equation fits mean of \(Y\) as a linear function of many variables:

\[Y_i = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(X\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(X\) is not full rank, cannot be inverted, cannot do least squares.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(4\). Slope \(b_k\) captures variation in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).

Multivariate Least Squares:

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)

  • The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

The CEF

Regression and CEF

How does multivariate regression relate back to Conditional Expectation Function?

  • Angrist and Pischke give justification
  • least squares is the best linear approximation of the conditional expectation function \(E(Y_i | X_i)\)

Least Squares and CEF

Least Squares predictions given by \(\widehat{Y_i} = X_i'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (linear approximation with minimum distance to the truth).

  • Least Squares also fits a line for the \(E(Y_i|X_i)\). Not just a line for \(Y_i\). We could use \(E(Y_i | X_i)\) as the dependent variable

Exercise: Download the data grouped by AGE in years

if (!require(httr)) install.packages('httr')

r = GET("http://mdweaver.github.io/poli572b/lecture_7/example_7.csv")
cef = as.data.frame(content(r))

Exercise:

Use matrix calculations in R to obtain linear approximation of \(E(Y_i | X_i)\):

  • What is the intercept? What does it mean?
  • What is the slope? What does it mean?
#With individual data...
X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN

beta = solve(t(X) %*% X) %*% t(X) %*% y

beta
##           [,1]
## [1,] 23404.561
## [2,]  3535.247

Do your estimates agree? Why or why not?

We are not accounting for the number of people in each value of \(X_i\)

#With aggregate data...
X = cbind(1, cef$AGE)
y = cef$INCEARN
w = cef$respondents 
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y

beta
##           [,1]
## [1,] 23404.561
## [2,]  3535.247

Least Squares and CEF

We can reproduce any individual-level least squares by:

  1. Collapsing data to all unique values \(X_i\) to groups: \(X_g\) across \(g \in G\)
  2. Take the group mean of \(Y_i\): \(Y_g\)
  3. Regress \(Y_g\) on \(X_g\), weighting by group size \(n_g\).

We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.

Dummy Variables

Dummy Variables

What are they?

  • Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.

  • Can be used for Yes/No categorical variables
  • Can be used for variables with many categories (e.g. race, marital status, etc.)

How do we make them?

  • Easy to do in R with as.factor(): each unique value will get a dummy.

lm(y ~ as.factor(any_variable))

Dummy Variables

m1 = lm(INCEARN ~ FEMALE, acs_data)
  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?

Dummy Variables

m2 = lm(INCEARN ~ FEMALE + MALE, acs_data)
  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
  4. What does the \(b_2\) tell us?

Dummy Variables

m3 = lm(INCEARN ~ -1 + FEMALE + MALE, acs_data)
  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
#Design matrix
model.matrix(m1) %>% head
##   (Intercept) FEMALE
## 1           1      1
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      0
## 6           1      0
#Interpret:
coefficients(m1)
## (Intercept)      FEMALE 
##   203342.42   -61756.51
#Design matrix
model.matrix(m2) %>% head
##   (Intercept) FEMALE MALE
## 1           1      1    0
## 2           1      0    1
## 3           1      0    1
## 4           1      1    0
## 5           1      0    1
## 6           1      0    1
#Interpret:
coefficients(m2)
## (Intercept)      FEMALE        MALE 
##   203342.42   -61756.51          NA
#Design matrix
model.matrix(m3) %>% head
##   FEMALE MALE
## 1      1    0
## 2      0    1
## 3      0    1
## 4      1    0
## 5      0    1
## 6      0    1
#Interpret:
coefficients(m3)
##   FEMALE     MALE 
## 141585.9 203342.4

Dummy Variables

What do they do?

If there is an intercept

  • one category must be excluded; otherwise linear dependence. (“reference group”) (R will do this by default)
  • slope is difference in means between the group indexed by the dummy and the intercept (“reference group” mean)

If there is no intercept:

  • all categories must be included
  • fits separate intercepts for each group: coefficient is the mean of the group indexed by the dummy (when all other variables are \(0\)).

Example:

How did different regions in the UK vote on Brexit?

Download the data:

if (!require(httr)) install.packages('httr')
## Loading required package: httr
r = GET("http://mdweaver.github.io/poli572b/lecture_7/analysisData.csv")
brexit = as.data.table(content(r))
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Regn_Cd = col_character(),
##   Region = col_character(),
##   Area_Cd = col_character(),
##   Area = col_character()
## )
## See spec(...) for full column specifications.

Example:

table(brexit$Region)
## 
##                     East            East Midlands                   London 
##                       47                       40                       33 
##               North East         Northern Ireland               North West 
##                       12                        1                       39 
##                 Scotland               South East               South West 
##                       32                       67                       37 
##                    Wales            West Midlands Yorkshire and The Humber 
##                       22                       30                       21
hist(brexit$Pct_Lev)

Do the following:

Using the lm function:

  1. Regress Pct_Lev on Region
  2. Interpret the coefficients returned (summary(your_model))
  3. Regression Pct_Lev on -1 + Region
  4. Interpret the coefficients.
  5. Without looking at your_model$fitted.values, what can you say about the \(\widehat{Y}\) produced by these two regressions?

Dummy Variables

## 
## Call:
## lm(formula = Pct_Lev ~ RegionU, data = brexit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.8121  -4.7843   0.4257   5.1457  30.5685 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     39.13687    1.39400  28.075  < 2e-16 ***
## RegionULondon                   -0.04536    1.95643  -0.023    0.982    
## RegionUNorth West               16.77825    1.88088   8.920  < 2e-16 ***
## RegionUYorkshire and The Humber 19.51312    2.21458   8.811  < 2e-16 ***
## RegionUNorth East               20.34229    2.66931   7.621 2.15e-13 ***
## RegionUWest Midlands            21.17779    2.00400  10.568  < 2e-16 ***
## RegionUEast Midlands            20.43762    1.87025  10.928  < 2e-16 ***
## RegionUSouth West               14.54745    1.90365   7.642 1.87e-13 ***
## RegionUEast                     17.82525    1.80729   9.863  < 2e-16 ***
## RegionUSouth East               13.03312    1.69451   7.691 1.34e-13 ***
## RegionUWales                    14.21085    2.18398   6.507 2.50e-10 ***
## RegionUNorthern Ireland          5.08312    8.00793   0.635    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.886 on 369 degrees of freedom
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.4264 
## F-statistic: 26.68 on 11 and 369 DF,  p-value: < 2.2e-16

How is this different from what you produced?

Dummy Variables

During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.

  1. Regress Pct_Lev on total_precip_prepolling + Region.
  2. Thinking about the least squares estimator: how is the slope on total_precip_prepolling on calculated in this case? (think of \(X_k^*\))
  3. In light of \(2\), what does the slope on total_precip_prepolling mean?
## 
## Call:
## lm(formula = Pct_Lev ~ total_precip_prepolling + Region, data = brexit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.7877  -4.7351   0.4749   4.7649  28.7536 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     52.55803    1.49590  35.135  < 2e-16 ***
## total_precip_prepolling          0.05186    0.01165   4.451 1.13e-05 ***
## RegionEast Midlands              7.01178    1.92738   3.638 0.000314 ***
## RegionLondon                   -19.26625    1.77485 -10.855  < 2e-16 ***
## RegionNorth East                 6.92113    2.67736   2.585 0.010120 *  
## RegionNorthern Ireland          -8.33820    7.83610  -1.064 0.287993    
## RegionNorth West                 3.35710    1.93773   1.732 0.084024 .  
## RegionScotland                 -13.42789    2.02081  -6.645 1.10e-10 ***
## RegionSouth East                -3.11209    1.51142  -2.059 0.040193 *  
## RegionSouth West                 1.11001    1.95693   0.567 0.570911    
## RegionWales                      0.78957    2.21970   0.356 0.722260    
## RegionWest Midlands              7.75486    2.05162   3.780 0.000183 ***
## RegionYorkshire and The Humber   6.09083    2.24826   2.709 0.007061 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.692 on 368 degrees of freedom
## Multiple R-squared:  0.4715, Adjusted R-squared:  0.4542 
## F-statistic: 27.35 on 12 and 368 DF,  p-value: < 2.2e-16

Dummy Variables

You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.

levels = c('Scotland', 'London', 'North West', 'Yorkshire and The Humber', 'North East', 'West Midlands', 'East Midlands', 'South West', 'East', 'South East', 'Wales', 'Northern Ireland')
brexit[, RegionU := factor(Region, levels = levels)]

Interpretation

Example

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291485  -89191  -28271   71281 1067525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9929.8     9370.4    1.06    0.289    
## UHRSWORK      1245.4      111.9   11.13   <2e-16 ***
## AGE           3311.3      125.0   26.50   <2e-16 ***
## LAW         -49672.0     2647.1  -18.76   <2e-16 ***
## FEMALE      -42281.4     2811.5  -15.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127300 on 9995 degrees of freedom
## Multiple R-squared:  0.1502, Adjusted R-squared:  0.1498 
## F-statistic: 441.6 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Example

Example

Example

Example

In your small groups:

  1. could we do better at better approximating the conditional expectation function, e.g., of Earnings by Age?

  2. How would you do it?

  3. Try it out

  4. How would you apply this to the example above? ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)?